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Systematic Identification of Novel Protein Domain 
Families Associated with Nuclear Functions 

Tobias Doerks, 1 ' 2 - 4 - 5 Richard R. Copley, 1 - 4 Jorg Schultz, 1 ' 2 Chris P. Ponting, 3 
and Peer Bork 1,2 

European Molecular Biology Laboratory, 691 14 Heidelberg, Germany; 2 Max-Delbrueck-Center, 13092 Berlin, Germany; 
3 Medical Research Council Functional Genetics Unit, University of Oxford, Department of Human Anatomy and Genetics, 
Oxford OX 7 3QX, UK 

A systematic computational analysis of protein sequences containing known nuclear domains led to the 
identification of 28 novel domain families. This represents a 26% increase in the starting set of 107 known 
nuclear domain families used for the analysis. Most of the novel domains are present in all major eukaryotic 
lineages, but 3 are species specific. For about 500 of the 1200 proteins that contain these new domains, nuclear 
localization could be inferred, and for 700, additional features could be predicted. For example, we identified a 
new domain, likely to have a role downstream of the unfolded protein response; a nematode-specific signalling 
domain; and a widespread domain, likely to be a noncatalytic homolog of ubiquitin-conjugating enzymes. 



Large proteins are often composed of domains. These are 
polypeptide regions that adopt compact three-dimensional 
(3D) structures and are often found in diverse molecular con- 
texts (Janin and Chothia 1985). The independent evolution- 
ary histories of domains found within the same protein lead 
to an assumption that the domain is the fundamental unit of 
protein structure and function (Doolittle 1995). Domains are 
most readily observable in known 3D structures, but because 
of the relative paucity of available structural data, the major- 
ity of protein domain families have been identified first by 
sequence analysis. Many domains are 'genetically mobile', 
meaning that they can be found associated with different do- 
main combinations in different proteins. The term 'module' is 
sometimes used to distinguish between mobile domains and 
those that are invariably found in identical molecular con- 
Sequence characterization of domain families represents 
a first step toward the determination of their 3D structures 
and molecular functions. Domain identification from se- 
quence is usually performed on a case-by-case basis, by apply- 
ing a variety of automatic methods supplemented with care- 
ful manual analysis. The number of protein domain families 
characterized from sequence has been increasing steadily over 
the years and has led to the development of Web-based re- 
sources such as SMART and Pfam (Schultz et al. 1998, Bate- 
man el al. 2000) for effective and reliable domain identifica- 
tion. 

We have systematically searched for new domain fami- 
lies, using proteins annotated by the SMART (Simple Modular 
Architecture Research Tool) database of domains as our start- 
ing point. We have targeted our strategy to all proteins that 
contain at least one of 107 types of predominantly nuclear 
domains in the SMART collection. Crucial to our technique is 
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e knowledge of known domain boundaries pro- 
vided by databases such as SMART and Pfam (Schultz et al. 
1998, Bateman et al. 2000). Using sequence regions not cov- 
ered by previously characterized domains, we have searched 
for homologs in nonredundant sequence databases and used 
previously computed domain architectures to determine 
which of the initial search regions could correspond to new 
domain families. A manual analysis of the various candidate 
families led to the final characterization of novel domain 
types and their sequence borders. 



Classification of the Novel Domains 

The protocol described earlier revealed a variety of novel do- 
mains that could be classified into four broad categories: 

1. Fifteen novel domain families with representatives in di- 
verse molecular contexts in different species (Table 1, Part 
A). Of these, three have recently been described on sepa- 
rate occasions (Table 1, Part A, Callebaut et al. 2001; Clis- 
sold and Ponting 2001; Doerks et al. 2001). 

2. Three domain families were found to be specific to single 
or closely related species (Table 1, Part B). 

3. Seven further domain families are likely to be divergent 
members of previously recognized families, with weak (but 
not statistically significant) similarity to previously de- 
scribed domains. (One of these, the BED domain, has been 
recently published independently (Aravind 2000)) ( fa- 
ble lc). 

4. Three additional families were recognized as representing 
family-specific N or C-terminal extensions of previously 
known domains (Table 1, Part D). These regions always 
co-occur with a particular neighboring domain, although 
their domain context within the protein as a whole may 
vary. Because of their size, they are likely to have well- 
defined structures, but might only exist in the context of 
the domain that they extend. In three of these cases, the 
extension is only found in closely related species. We do 
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Table of Novel Domains 



representative 
sequence 
Species (domain borders) 



Part A. — Domains Present in Different Species 



NEUZ 
ZnFJTF 



Jumonji related family 



Domain in chromatin 
remodeling SI domain 
containing and Zinc 
finger proteins 

Proteins involved in 
regulation of nuclear 
pre-mRNA 

Different transcription and 
chromosome 
remodeling factors 

TBC, LysM and other 
proteins 

Protein knases, UBA or 
UBX domain containing 
proteins and glycanases 

Helicases and SANT 
domains 

Proline-rich, in 
spliceosome associated 

Trithorax and 
X-chromosome 
inactivating proteins 

Trithorax and 
X-chromosome 
inactivating proteins 

TBC, PH, FYVE and other 



DSRM or ZnF_C2H2 

domain containing 

proteins 
Domain in neuralized-like 

proteins 
Domain in transposases 

and transcription 

factors 



750 «/p DNA-binding, 35 
Chromatin 
modulation 



60 « DNA-binding 30 

220 a/e+0 Enzyme 30 

60 a/fi RNA-binding 25 

70 ft DNA-binding 20 

60 a RNA-or 15 
snRNP-binding 

40 <*/0 Unknown 25 

90 ct/p Unknown 25 



BRIGHT, jmjN Eu, 

PHD, FBOX, y, a, c, d, h 

LRR, C2, TPR 

PLAc, CXXCt, 

ZnF_C2H2 
S1, SH2, C2HC, Eu 

HhH y, a, c, d, h 



RRM, PWWP, 
SURPt, 
C-Patch 



AT_F 



PHD, 



+ 0 Metal-binding 20 



, BROMO, 
MBD 

TBC, LysM, R3H, y, a, c, d, h 
FBOX 

C2H2, UBA, TGc, y, a, c, d, h 
UBX, S_TKc, 
STYKc 

SANT, BROMO y, a, c, d, h 

DEXDc, HELIc 
SAP, C2HC y, a, c, d, h 

PHD, SET, PWWP a, c, d, h 

PHD, SET, PWWP a, c, d, h 

DENNt, TBC, c, d, h 
PLAT, PH, C1, 
FYVE, GST, 
SH3 

CHROMO, PHD, c, d, h 

TFSM2, 

DEXDc, HELIc, 

SANT, BROMO 
C2H2, DSRM c, d, h 

SOCS, RING, c, d, h 

SPRY, SH2 
KRAB, BTB a, d, h 



Q9VNA1§ 
(1163-1 325) 
Q9MAT3 
(323-386) 

P25439§ 
(501-573) 
016997 
(200-357) 



Q24742§ 
BAB 14033 



Q19299 
(199-321) 
Q9ZWT4 
(100-199) 



FBD Domain in FBOX and 

other domain 
containing plant 
proteins 

ZnF_PMZ Plant mutator transposase 
zinc finger domain 



SPK 



SET and PHD domain 
containing proteins and 
protein kinases 



Metal-binding 125 



SET, ICE p10f, 
ICE p20f, 
ZnF_C2HC, 
PHD, STYKc 



(Table continues on following page.) 
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e 1. Table of Novel Domains (Continued) 



representative 
sequence 
Species (domain borders) 



BED zinc finger, Related 
to C2H2/C2H2 zinc 
fingers (based on 
pattern similarity) 

Catalytic domain of 
ctd-like phosphatases, 
related to phosphatase 
superfamily (based on 
pattern similarity) 

RING finger and WD 
repeat containing 
proteins and DEXDc 
helicases, related to the 
UBCc domain (revealed 
by hmm searches) 

Bromodomain 

transcription factors 
and PHD domain 
containing Proteins, 
related to archaeal 
histone-like 
transcription factors, 
defined by PFAM 
(revealed by PSI-Blast 
results with less 
significance 
(E = 0.041)) 

Zinc finger, PHD domain 
and WD repeats 
containing proteins, 
related to SANT 
domain (after the 

Q9SR68 bridges to 
SANT domains 
(E = 0.002)) 

Zinc finger in DBF-like 
proteins, related to 
C2H2 zinc fingers 
(revealed by pattern 
similarity and hmm 
searches, E value = 1 .4) 

C4-zinc finger and HLH 
domain containing 
kinase subfamily of 
choline kinases (after 



vly Recognized Divergent Subfamilies 

vfeta] 50 AT.Hook, 

binding PTPc_DSPc 



«/f> Phosphatase 



DNA-binding 25 



S_TKc, RING, 
WD, UPF29f, 
DEXDc, HELIc 



DNA-or 60 C2H2, PHD, WD 

Potein- 
binding 



»/p Enzyme 70 ZnF_C4, HLH, 



(Table continues on following page.) 



a be modules, and 



Alignments of the novel domains, the proteins they are found 
in, and their phyletic distribution are publicly available in the 
SMART database (http://smart.embl-heidelberg.de/). 

Of the total 28 regions discovered, 8 were found by 
simple single-pass blast searches. For the remaining 20, 
psi-blast searches were necessary to provide statistically 



significant links between proteins with different domain 
architectures. This is broadly consistent with the reported 
three-fold sensitivity of psi-blast over blast (Park et al. 
1998). 

Conserved protein domains are most useful when they 
can be used to make predictions of likely function. For the 
domains presented here, this was possible to varying degrees. 
We provide three examples to illustrate the more important 
categories described earlier, and show the types ol (necessarily 
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Table 1. Table of Novel Domains (Continued) 



Sec representative 
Length struct. Pred. No. of Associated sequence 
Domain Description (AS) pred. function proteins domains Species (domain borders) 



AWS Associated with SET 

domain, subdomain 
of PRESET* 
(hmm searches, 
E value = 0.52) 

POX Domain associated with 

HOX-domains 

PRE_C2HC Associated with zinc 



Part D. — Family Specific Extensions of Known Domains 

50 Histone 25 SET, PWWP, 

modification AT.Hook, WW, 

PHD, POSTSET, 



Q38897 
(199-337) 
044939§ 
(546-616) 



First column, domain name; second column, domain description (e.g., associated domains or well-described proteins); third column, approxi- 
mate domain length (number of amino acids); fourth column, secondary structure prediction (Rost et al. 1994)(n: domain consists of n-helices; 
3: domain consists of 0-strands; ft/[3: domain consists of «-helices and B-strands); fifth column, predicted function of novel domain; sixth 
column, number of proteins containing the novel domain; seventh column, names of associated domains (domain names are according to the 
Simple Modular Architecture Research Tool (http://smait.embl-heidelberg.de) (Schultz et al. 1998, 2000) or the domain is defined by Pfam 
(Bateman et al. 2000)t; eighth column, species representives containing the novel domain. Abbreviations: eu, eubacteria; virus, viruses; 
y, yeast; a, Arabidopsis thaliana; c, Caenorhabditis elegans; d, Drosophila melanogaster; h, Homo sapiens. The ninth column, gives the accession 
number of representative protein and region of the detected domain in amino acids. 
•Novel domain is accepted, in press, or published recently. 
^Unpublished domain. 

§Additional HMM searches are needed to define all novel domain-containing proteins. 

The more conserved parts of the domains FYRN and FYRC were called ATA1 and ATA2 in human ALR protein (Prasad et al. 1997) and FYR 
(merged in one domain) in plant proteins (Balciunas and Ronne 2000), respectively. 



conjectural) functional information that can be inferred from 
the present identifications. 

A Widespread Module in Diverse Species: A Novel 
Domain in Peptide N-glycanases and Other Putative 
Nuclear Proteins 

The majority of our novel domains are found in diverse spe- 
cies and in different protein contexts without significant se- 
quence similarity to other domains. A particularly interesting 
example is described here. 

A hypothetical . WW'/i/o/'v/'v protein (SpTREMBL accession: 
Q9MAT3) is predicted to contain two N-terminal zinc finger 
motifs (ZnF_C2H2), followed by a UBA domain (Hofmann 
and Bucher 199b). A predicted coiled-coil region links this to 
a C-terminal half that contains no currently described do- 
mains, psi-blast searches initiated with this C-terminal re- 
gion show significant sequence similarity (E- value <1CT 5 ) to 
UBX domain-containing proteins and metazoan homologs of 
peptide :N-glycanases (PNGases). 

Searching of preliminary protein predictions from the 
Plasmodium /'alii/ianim genome, with the conserved region 
identified in an Arabidopsis sequence (SpTREMBL accession 
no. Q9FKI1), revealed an additional association with a UBCc 
domain-containing protein (T. -value ,5 x 10 ''). 

We refer to these homologous regions as PUG domains, 
after the Peptide:N-Glycanases and other putative nuclear 
UBA or UBX domain-containing proteins. PNGases are be- 
lieved to have a role in the unfolded protein response (UPR) 
(Suzuki et al. 2000). The UPR results in increased levels of 
transcription of endoplasmic reticulum ( ER)-resident protein- 
coding genes, following accumulation of unfolded proteins in 
the ER. The PUG domain is found to co-occur in proteins with 



three domains that are central to ubiquitin-mediated prote- 
olysis: UBA, in Arabidopsis, UBCc in Plasmodium, and UBX in 
mammals and Arabidopsis. This indicates that PUG domain- 
containing proteins might link the UPR to ubiquitin- 
mediated protein degradation. Other links between the UPR 
and 1 UK )-medialed proteolysis have been shown previously 
(Travers et al. 2000). 

The candidate orthologs of PNGases in Saccharaomyces 
arcvisiac, Saccliaiviiprcs pumbc, and Arabidopsis do not appear 
to encode this domain, indicating its presence in these pro- 
teins is a metazoan innovation. Of these putative PNGases, 
only the S. arcvisiac protein has been directly characterized; it 
appears to be localized to the nucleus, with a lower level oc- 
curring in the cytosol (Suzuki et al. 2000). As the apparent 
orthologs in metazoan genomes appear to be present singly, 
rather than as multiple paralogs (which might imply func- 
tional variation), it seems likely that the proteins encoded by 
them will have a similar cellular localization. 

Additional HMMer2 searches, using an HMM derived 
from these PUG domain sequences, showed marginal similar- 
ity to IRElp-like kinases (SpTREMBL accession: Q9SHL6) (E- 
value: 0.21) within a region known to be homologous to the 
C-terminal tail of 2'-5' oligo (A)-dependent ribonuclease 
(Zhou et al. 1993) (see Fig. 1). Although of only marginal 
significance, the similarity also extends to cellular function 
because IRElp-like kinases are known to initiate the UPR 
(Shamu and Walter 1996). The C-terminal tail of IRElp is 
required for induction of the UPR (Shamu and Walter 1996), 
and has been shown to possess site-specific endoribonuclease 
activity (Sidrauski and Walter 1997). This activity is consis- 
tent with the C-terminal location for RNase activity found in 
its homolog, 2'-5' oligo (A)-dependent ribonuclease (Bork and 
Sander 1993). Consequently, we tentatively suggest the pres- 
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Figure 1 (A) Multiple sequence alignment of PUG domains of N-glycanases (PNGImm, PNGIdm), 
UBX domain-containing proteins (F13M7, CG5469), HOX domain containing proteins (F3M18, 
MLN1), UBA/Zinc-finger-domain-containing proteins (K24G6, T8011), and hypothetical zinc metallo- 
proteinase (MXH1) and multiple sequence alignment of PUG-like domains in serine/threonine protein 
kinases / RNAses (F26K24, MJB20, K16H17, IREImm, ERN1, IREIsc, YQG4, SPAC167, CG4583, 
RN5Ahs, RN5Amm). First column, protein names; second column, species names (at, Arabidopsis 
thaliana; ce, Caenorhabditis elegans; dm, Drosophila melanogaster, hs, Homo sapiens; mm, Mus muscu- 
lus; pf, Plasmodium falciparum; sc, Saccharomyces cerevisiae; sp, Schizosaccharomyces pombe); third 
column, start of the domain in the respective sequences; rightmost column, database accession num- 
bers. Conserved positively charged residues are shown in pink; conserved hydrophobic residues are 
shown in blue; other conserved residues are shown in bold. The predicted secondary structure taken 
from the consensus of the alignments (B/H, strand/helix predicted with expected average accuracy 
>82%; b/h, strand/helix predicted with expected average accuracy <82%) (Rost et al. 1994) is shown 
below, respectively (consistent secondary structure in bold letters). The consensus sequence (conserved 
in 80% of the sequences) for both alignments is shown below, s, I, p, h, c, -, L, N, and F indicate small, 
aliphatic, hydrophobic, polar, charged, negatively charged residues, conserved Leucines, Asparagine, 
and Phenylalanin. (8) Domain architecture of proteins containing the PUG domain (green) and the 
PUG-like domain (green dark horizontal pattern). Only proteins with distinct modular organizations are 
shown. The domain names are those of the Simple Modular Architecture Research Tool (http:// 
smart.embl-heidelberg.de) (Schultz et al. 1 998, 2000). C2H2, zinc finger C2H2 DNA-binding domain; 
PAW, domain in PNGases and other worm proteins; PQQ, fS-propeller repeat; S_TKc, serine/threonine 
protein kinase catalytic domain; TGc, transglutaminase/protease-like homologs catalytic domain; UBA, 
biquitin-associated domain; UBCc, catalytic domain of ubiquitin-conjugating enzymes; UBX, domain 
present in ubiquitin regulatory proteins; TM, transmembrane region. 



ence of divergent PUG domains in 
the C termini of IRElp-like kinases. 

Further analysis of the meta- 
zoan PNGase sequences revealed a 
conserved region that is also pre- 
sent in multiple copies in hypo- 
thetical Caenorhabditis elegans pro- 
teins (e.g., four copies in C17B7.5). 
This domain was not found in the 
initial rounds of searching because 
it does not occur with any of our 
starting set of nuclear domains. We 
have included this domain in the 
SMART collection and have named 
it PAW (domain present in PNGases 
and other worm proteins). 

Novel Modules Found 

in Narrow Phyletic Ranges: 

A Nematode-Specific Putative 

Signaling Domain 

in C. elegans 

Lineage-specific expansions of pro- 
tein domain families (i.e., a large 
increase in the number of a particu- 
lar domain in one genome com- 
pared with other genomes) are a 
widespread phenomenon (e.g., 
International Human Genome Se- 
quencing Consortium 2001). In ex- 
treme cases, it may not be possible 
to establish links between a domain 
that is widespread in one organism 
and known domains seen in other 
species. Such cases may represent 
genuine 'invention' of new do- 
mains, or, perhaps more likely, in- 
stances where the tempo of mo- 
lecular evolution has risen to the 
extent that sequence similarity 
with known domains is no longer 
detectable. Alternative scenarios of 
massive loss from other lineages are 
less parsimonious. Three (i.e., 
-11%) of our new domains appear 
to occur in very restricted phyloge- 
netic lineages; these exclude spe- 
cies-specific N- or C-terminal ex- 
tensions of known domains (see 
Table 1, Part B). 

psi-blast searching with 
the region C-terminal to a SET do- 
main (Cui et al. 1998) of the hypo- 
thetical protein Y43F11A.5 (Sp- 
TREMBL accession: Q9U2G8) de- 
tected a novel domain found in 
many different predicted proteins 
from C. elegans but thus far in no 
other species. The domain is -120 
residues in length, and found asso- 
ciated with the catalytic domain of 
caspases (CASc), protein kinases of 
undetermined specificity (STYKc), 



Genome Research 51 

www.genome.org 



Downloaded from genorrje.cship.org on September 23, 2009 - 
Doerks et al. 



Published by Cold Spring Harbor Laboratory Press 




Figure 2 Domain architecture of proteins containing the SPK do- 
main. Only proteins with distinct modular organizations are shown. 
The domain names are those of the Simple Modular Architecture 
Research Tool (http://smart.embl-heidelberg.de) (Schultz et al. 1 998, 
2000). CASc, catalytic domain of caspases; PHD, PHD C4HC3 zinc 
finger; SET, (Su(var)3-9, Enhancer-of-zeste, Trithorax) domain; 
STYKc, catalytic domain of protein kinases. The UCH-2 (ubiguitin car- 
boxy-terminal hydrolase family 2) domain is defined by Pfam (Bate- 
man et al. 2000) 

and the SET methyltransferase domain. Multiple tandem cop- 
ies of the domain may be present in the same sequence (Fig. 
2). We named this domain SPK [associated with SET, PHD 
(Aasland et al. 1995), protein Kinase]. The alignment is pro- 
vided on the Web (see http://www.embl-heidelberg.de/ 
~doerks/alignment_fig3 .html/) 

Further analysis of nucleic acid sequence databases re- 
vealed SPK domains in the Caenorhabditis briggsae sequence, 
in regions for which no proteins have been predicted (e.g., 
NCBI GI:11095060, data not shown). No other species were 
found to contain the domain. It is possible that the domain 
exists in nematode lineages other than Caenorhabditis, but is 
simply nol found due lo insufficient sequence coverage of 
these species. 

The association of SPK with SET, PHD, catalytic protein 
kinases or caspase domains (see Fig. 2) hints at an important 
role in metabolic, developmental, or evolutionary processes 
lluil are unique lo Caenorhabditis. However, none of the pu- 
tative proteins in which the domain has been found have 
been characterized by any experimental technique other than 
RNAi screening. All homologs tested by RNAi are wild type 
according to wormbase (http://www.wormbase.org/). This 
technique would not be expected to reveal more subtle phe- 
notypes associated with later developmental stages. 

Modules in New Contexts: A Noncatalytic Subfamily 
of Ubiquitin-Conjugating Enzyme Homologs 

The protocol presented here detects regions of homology be- 
tween sequences where no domains have previously been as- 
signed. Some of our newly identified regions appear to be 
distantly related to known domains, but correspond to new 
molecular contexts. Such cases indicate potential changes of 
domain function or add new insights to the function of the 
proteins in which the domain has been newly identified. An 
increasing number of known domains are being realized as 
members of wider superfamilies because of the availability of 
3D structures. For example, the UBX domain has recently 
been reclassified as a subfamily of the ubiquitin fold super- 



family (Buchberger et al. 2(l(ll). In addition to protein struc- 
ture determination, carefully applied sensitive sequence 
searching methods can also provide such insights. This is ex- 
emplified by the following example detected in this study. 

The mouse c;C\2 elF2« kinase and histidyl-tk\A syn- 
thetase (SpTKKMHI. accession: Q9Q/0.S) is an essential com- 
ponent ol translation control (lentsch et al. 199 1; Sallleger et 
al. 1998). A I':', i - search initiated with the region \- 
terminal to an inactive protein kinase domain (see Fig. 3) in 
theGGN2 protein revealed significant similarity to presumed 
orthologs in other eukaryotic species from yeast to verte- 
brates. Further I'.".' • iterations and additional 1 1 VI VI 
seau lies re\ eat sign i I k an 1 si mi kiri t\ lo \ VI l-repeal-ci in lain ing 
proteins; yeast DEAD (DF.XD)-like helicases; UI>F(>(>29, an un- 
characteri/ed protein family from the I'I'am database (acces- 
sion no. PF012O5); a range of hypothetical proteins; and 
many RING linger-containing proteins. We called the newly 
defined region RWD after the better characterized RING fin- 
ger and WD-domain-containing-proteins and 1)1 Al)-like he- 
licases. psi-blast searches initiated with different seeds also 
revealed homology with ubiquitin-conjugating enzymes 
(UBCc) domain, (e.g. SpTREMBL acc: Q94721 hits Q9SDY5 on 
iteration 3, E value = 9 x 1CT 4 ), although the catalytic cys- 
teine critical tor ubiquilin-conjugaling activity is nol con- 
served in most members of the novel subfamily (see http:// 
www.embl-heidelberg.de/~doerks/aligmenl fig4.htm 1/)). 
This observation is particularly interesting in light of previous 
experimental studies on A07 (NpTRl'MlM. accession: Q9QZU0), 
a protein that includes both an RWD and a RING finger do- 
main, that have shown that a region between 85 and 363 
amino acids in A07 (including the RING linger) hinds ubiq- 
uitin-conjugating enzyme E2 and acts as a substrate for E2- 




Figure 3 Domain architecture of proteins containing the RWD do- 
main. Only proteins with distinct modular organizations are shown. 
The domain names are according to the Simple Modular Architecture 
Research Tool (http://smart.embl-heidelberg.de) (Schultz et al. 1 998, 
2000). DEAD (DEXDc)-like helicases superfamily (N-terminal do- 
main); HELICc, helicase superfamily (C-terminal domain); RING, RING 
finger domain; STYKc, protein kinases (unclassified specificity); UBA, 
ubiquitin-associated domain; WD40, WD40 repeats. The RING finger 
domain in the dashed box is not recognized by SMART or Pfam. The 
STYKc domain in the dashed box is degenerated (partial and non- 
catalytic). The IBR (In between Ring fingers) domain and UPF29 (un- 
characterized protein family) are defined by Pfam (Bateman et al. 
2000). The HisRS (histidyl-tRNA synthetase) domain is defined by 
literature (Sattleger et al. 1998). 
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Predictions of Function 

On the basis of reports in the literature and/or co-occurrence 
with previously identified domains, some functional features 
can be predicted for 78.6% of our newly identified set of 28 
domain families. This represents an increase in the state of 
functional prediction for -700 proteins (i.e., the total number 
of distinct proteins that are covered by novel domains with a 
putative function; see Table 1, Parts A-D). The predicted func- 
tions represent a variety of different cellular processes and 
molecular functions such as DNA/RNA- or metal-binding pro- 
tein-protein interactions. 

Five further cases of function prediction are outlined as 
follows. 

Chromatin-Binding Domains 

The CSZ domain-containing protein SPT6 and orthologs regu- 
late transcription through establishment or maintenance of 
chromatin structure (Chiang et al. 1996; Winston 2001). A 
histone-binding capability for SPT6 has been experimentally 
confirmed (Bortvin et al. 1996). Here the CSZ domain is as- 
sociated with an SI- and two SH2 domains, which are un- 
likely responsible lor histone or chromatin binding, by 1 1 1 is 
process of elimination, we predict a histone- or chromatin- 
binding function for the novel CSZ domain. The presence of 
HhH motifs in some copies of the CSZ domain raises the al- 
ternative or compleinenLirv possibility ol a 1)\A and/or RNA 
binding function. 

We identified a novel domain as a tandem repeat in sev- 
eral hypothetical human proteins, as a single copy associated 
with PHD and TFS2M in the Drosophila gene CG6525, and as 
the Drosophila brahma and kismet genes. The kismet protein 
in Drosophila and its orthologs have been shown to be chro- 
matin-remodeling factors, required for segmentation and seg- 
mentation identity. Our domain includes a recently reported 
conserved region (BRK) in brahma and kismet that is thought 
to bind chromatin (Daubresse et al 1999). We thus propose a 
chromatin-binding function for the newly identified domain. 

Protein Interaction Domains 

Recent studies reveal the interaction of the RPR domain in 
protein pcfll with the C-terminal domain of the largest sub- 
unit of RNA polymerase II (Yuryev et al. 1996). Consequently, 
a similar function, or, less specifically, a protein-interaction 
function, is predicted for the RPR domain. 

PSP domains appear to be protein-binding domains. The 
PSP domain-containing protein Cus'lp is a component of a 
spliceosomal complex, associated with U2 snRNA (Gozani et 
al. 1996). Cuslp interacts directly with the snRNP Hshl55p 
by a region that overlaps with the PSP domain (Pauling et al. 
2000). 

The nuclear factor 90 (NF90) is a substrate and regulator 
of the eukaryotic initiation factor 2 kinase double-stranded 
RNA-activated protein kinase. The novel DZF domain in NF90 
overlaps with a region known as NF45 homology domain, 
which is assumed to be responsible for conformation estab- 
lishing of NF90 in the complex, where it may bind NF45 or 
other proteins (Parker et al. 2001). Thus, it is assumed that the 
DZF domain is a protein-protein interaction domain. 

Several other functional predictions for novel domains 
are proposed in Table 1. Even where no functional role is 
postulated, delineation of conserved domain boundaries pro- 
vides a starting point from which to undertake further experi- 
ments aimed at elucidating molecular function and cellular 



Predicted Localization of the Novel Domains 

Context can also be used to predict whether a novel domain 
is associated with a certain cellular localization. For example, 
some of our novel domains are only found with representa- 
tives from our initial set of predominantly nuclear domains 
(i.e., those used to seed the searching procedure). This logic 
indicates a putative nuclear function and role for 10 of the 
domain families presented here, representing -500 proteins. 
Others among the novel domain families are likely to have 
roles in both nucleus and cytoplasm. 

Novel Domains Related to Human Diseases 

Four (14%) of the newly discovered domain families and one 
ol the lamily-specilic domain extensions occur in proteins 
whose deficiencies are implicated in severe human diseases. 
The respective genes or chromosomal regions are known to be 
responsible for cancer, neurodegenerative processes, or chro- 
mosomal aberrations (Table 2). 

Although the extent to which the domains themselves 
are responsible for the phenotypic allects observed with these 
diseases is not known, the new domains are likely to assist in 
ascertaining the normal functions of these genes, and by im- 
plication, a better understanding of their dysfunction. 

DISCUSSION 

Some well-characterized signaling domains, such as SH2 or 
PH, are present in a huge number of proteins and occur in 
combination with a large number of oilier domains. The fact 
that they are so widespread no doubt facilitated their early 
detection and characterization. Perhaps unsurprisingly, the 
domains found in the present analysis have more limited dis- 
tributions than examples such as those. Even so, each new 
domain is found, on average, in 4.0 different architectures in 
-30 proteins. More widespread domains have been detected 
by our approach [e.g., the BRK domain occurs in more than 
seven different settings and a total of 30 proteins (Table 1, 
Part A)]. 

Only three (11%) of the newly discovered domains are 
species specific; of these, two are limited to plants and one is 
nematode specific (Table 1, Part B). This could simply reflect 
the fact that even when species-specific pathways exist, pro- 
teins involved in them are likely to be recruited from preex- 
isting components. Alternatively, species-specific domains 



Table 2. Table of Novel Domains or Family-Specific 
Extensions Which are Putatively Correlated with Phenotypic 
Dysfunctions 

Domain Protein OMIM Acc. 

name acc. No." Disease Nov. 1 ' 



AWS 096028 Wolf-Hirschhorn 602952 

syndrome (Stec et al. 
1998) 

RWD CAB88085 Monosomy 21 

(Orti et al. 2000) 
DNP 070656 Malignent astrocytoma 

(Nakamura et al. 

1998) 

FYRN/FYRC Q03164 Acute leukemia 159555 
(Djabali et al. 1992) 



"Accession number of related protein. 
Accession number of disease in OMIM database. 
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may more likely be found only with other species-specific 
domains, rather than with domains found in a large phyletic 
range, and so would be underrepresented in the results of the 
search methods applied here. 

In general, we cannot answer the question of whether 
the domains presented here have distant homologs that are 
not detectable using present methods (in common with any 
other new domain discovery report). The general evolution- 
ary principle of reuse of preexisting components indicates 
that this is likely. However, we believe that, even if this is the 
case, the domains presented here, by dint of considerable se- 
quence variation, are likely to have acquired new biological 
functions that are worthy ol independent investigation. 

In conclusion, we have identified a total of 28 novel 
domain families, 4 of which have been independently re- 
ported in the recent literature. Some of the domains are likely 
to be found in proteins localized to the nucleus. The predicted 
functions range from enzymatic activities to nucleotide bind- 
ing. The systematic search for novel domains led to a 26% 
increase over the known nuclear domains that have been dis- 
covered in the last 15 yr, when the C2H2 zinc finger was first 
described (Miller et al. 1985). 

The novel domains were all detectable using standard 
search methods (i.e., psi-blast), within default E-value 
thresholds. The novelty of our approach has been to search 
using all candidate sequences that could contain a new do- 
main of interest. In contrast, it would appear Iroin our results 
that only using well-characterized sequences to search pre- 
vents the detection of some domains. 

Although the majority of domains reported here are 
present in a wide variety of species, indicating that they have 
crucial biological roles, they are, on average, present in fewer 
proteins than previously reported domains. Taken together 
with the increasing volumes of data being produced by ge- 
nome projects, targeted approaches to domain detection, 
such as those presented here, must have a role in enumerating 
the evolutionarily conserved components required for life. 

METHODS 

Definition of Nuclear Domains 

A subset of SMART database families represents domains often 
found in nuclear proteins, as defined by annotation in se- 
quence databases (Schult/ et al. 2000). The computer pro- 
gram, Meta-A (nnotator ) (Eisenhaber and Bork 1998), 
which assigns protein localizations based on Swiss -Prot an- 
notations, was used to predict the most likely localization for 
a domain family. A domain family was included in this analy- 
sis if more than 80% of Swiss-Prot entries of proteins con- 
taining the domain were annotated by Meta-A as nuclear. By 
this method, 86 domains were assigned a nuclear location. 
Eleven suspected false positives were removed following lit- 
erature searches, and an additional 32 signaling domains with 
partial nuclear localization were added when literature 
searches could confirm this assignment. 

Thus, a set of 107 predominantly nuclear domain fami- 
lies was derived (see http://www.embl-heidelberg.de/~doerks/ 
nuclear_subset.html/). Many domains, such as those with 
RNA-binding functions, are found in proteins that translocate 
between the cytoplasm and the nucleus or are found in both 
cytoplasmic and nuclear proteins. Consequently, some of 
these 'nuclear' domain families may contain cytoplasmic pro- 
tein representatives. However, according to our protocol, 
based on Swiss-Prot annotations, the majority of proteins 
containing these domains will possess a significant popula- 
tion in the nucleus. 



Automatic Screening for New Domains 

All proteins containing one or more domains represented in 
the nuclear subset were extracted from public sequence data- 
bases, and their complete domain structure characterized us- 
ing SMART. Regions not annotated using known SMART do- 
main models were extracted, along with their domain context 
(i.e., position in the protein relative to other domains). Inter- 
domain sequences shorter than 30 amino acids were regarded 
as less likely to represent novel globular domains (although 
such short domains do exist) and discarded. Noncontiguous 
regions of the same sequence were analyzed independently of 
each other. All of these sequence regions were then clustered 
into groups using the ■ s program ol the SEALS package 
with a default single linkage clustering threshold of 50 bits 
(Walker and Koonin 1997). The longest member of each of 
these groups was tillered tor coiled-coil and low complexity 
sequences (Lupas et al. 1991; Wootton and Federhen 1996) 
and then used to search a nonredundant sequence database, 
using the iterative search algorithm (Allschul el 

al. 1997), with an £-value inclusion threshold of £<0.001. 
Eight search rounds were performed, unless the database 
searching procedure converged in a prior iteration (see 
Altschul et al. 1997 for details of the PSI-BLAST procedure). 
The domain organizations of all homologs identified by psi- 
blast searches were retrieved from the precalculated SMART 
database. The homologous regions identified in the searches 
were considered as the candidate domain lamily. Candidate 
regions that were found in different domain contexts (see 
following) in different proteins indicated a possible novel 
module family. These families were analyzed further using the 
methods described as follows. 



Manual Confirmation and Refinement 
of Predicted Domains 

To be considered as a module (i.e., a genetically mobile do- 
main), homologous sequences were required to be present in 
at least two diverse molecular contexts ('domain architec- 
tures'). Domain architectures (i.e., the linear arrangement of 
domains within a protein) were predicted using the SMART 
and Plain databases. When a sequence contained no predicted 
domain other than that ol the candidate lamily, this, too, was 
regarded as a distinct architecture. When a sequence invari- 
ably occurred either N- or C-terniinal to a single known do- 
main, it was regarded as an extension of the known domain. 

Inaccurate prediction of gene structure (i.e., artificial fu- 
sion of adjacent genes) might lead to new domain architec- 
tures being counted spuriously. Domain architectures were 
inspected manually for such apparently erroneous fusions; for 
example, protein sequences containing both nuclear and ex- 
tracellular domains were excluded. Similarly, a sequence was 
discarded if it had no homologs of similar domain architec- 
ture, but instead was similar to several pairs of nonhomolo- 
gous proteins and each pair corresponded to the presumed 
erroneously fused gene. 

At this stage, multiple alignments were generated 
(Thompson et al 1994) for all candidate domains. In conjunc- 
tion with known locations of domains and other sequence 
features (e.g., \ and C termini, transmembrane regions), these 
were used to define the borders of the putative new domains. 
In 10 cases, HMM-based searches of databases using HMMer2 
(Eddy 1998) were needed to detect additional family mem- 
bers. The results were checked manually for consistency, with 
respect to amino acid conservation and phyletic distribution, 
to exclude false positives, which would be expected from our 
10 HMM searches, given the £-value threshold of 0.1. Newly 
detected sequences were incorporated into the alignment, 
and the search procedure iterated. When these further analy- 
ses led to the identification of distant, but significant, simi- 
larity to annotated Pfam or SMART domains, the candidate 
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domain was not pursued further. In cases in which we were 
unable to connect a family to a known domain with signifi- 
cant sequence similarity, but in which hits with marginal 
similarity were present, we recorded the family as represent- 
ing possible divergent members of previously known protein 
domain families. 
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